Releases: huggingface/optimum-habana
Releases · huggingface/optimum-habana
v1.17.0: Transformers v4.49
Transformers v4.49
This release has been tested and validated for Transformers v4.49 and SynapseAI v1.20.
Model optimizations
- Use token_idx_cpu int instead of token_idx tensor in slicing #1848 @jaygala223
- Keep logits in bf16 #1835 @jaygala223
- Optimize SD3 Pipeline : Padding prompt Embeddings for softmax_hf8 compatibility and Efficient Utilization #1816 @deepak-gowda-narayana
- Add G3 perf WA for Qwen2VL #1884 @nngokhale
- Fix MPT regression #1857 @atakaha
Tests and CI
- Slow test updates #1804 @ugolowic
- Fix race condition when downloading nltk tokenizer #1802 @ugolowic
- fea(): Skipped the torch_fx tests #1797 @imangohari1
- Upstream tests #1834 @IlyasMoutawwakil
- test_examples: add missing clip-roberta baseline #1852 @uartie
- Separate slow tests by required number of cards #1803 @ugolowic
- Update PR doc build workflow #1904 @regisss
Other
- Disable HPU migration (future add-on to HF diffusers) for OH diffusers #1866 @dsocek
- Allow explicit control over flash_attention_fast_softmax setting #1851 @astachowiczhabana
v1.16.0: Deepseek V3, SynapseAI v1.20, Llama 405b, AWQ
SynapseAI v1.20
This release has been tested on and validated for SynapseAI v1.20.
New models
- Add Qwen2-VL #1542 @nngokhale
- Add video-llava model support #1522 @kaixuanliu
- Enable the i2vgen pipeline #1670 @yuanwu2017
- DeepSeek_v3 support #1735 @srajabos
Llama 405b
- Enable Llama 3.1 405B in FP8 #1745 @jaygala223
- v1.16 Llama3-405B text-generation. Added DEEPSPEED_USE_HABANA_FRAMEWORKS_DETERMINISTIC_API flag. #1812 @dsmertin
- Revert placing llama on cpu #1827 @ugolowic
AWQ
- Enable awq int4 in Gaudi #1691 @sywangyi
- Fix dependency issue with --load_quantized_model_with_autoawq #1759 @schoi-habana
Various model optimizations
- Optimizations and WAs to support HPU execution for Detr-Resnet-50 #1334 @sandeep-maddipatla
- Optimized DeepSeek-v2 on Gaudi #1677 @gyou2021
- Add xlm-roberta model support for tei-gaudi use case #1715 @kaixuanliu
- Optimized SD3 pipeline #1682 @deepak-gowda-narayana
- Add clear hpu cache flag for stable perf #1634 @jaygala223
- Fix graph breaks in Mixtral #1705 @ShengYang1
- Add batch splitting in attention layer to hide NIC latency #1640 @kalyank007
- Fix llama FP8 perf issue, kvcache.update should be used since FP8 patches KVCache #1756 @sywangyi
- Add HPU fp8 Dynamic MOE #1761 @dudilester
Sentence Transformers
CI
Other
- Fixed formatting #1693 @imangohari1
- Fix FlUX.1_dev guidance_batches bug for pad case in _split_inputs_into_batches #1607 @huijuanzh
- Fix peft error in Gaudi1 #1627 @sywangyi
- Update README.md #1678 @skaulintel
- Fix custom ops loading in diffusers #1655 @dsocek
- Fix ddpo finetune issue in torch2.5.1 #1666 @sywangyi
- Adding Deepspeed zero1 config #1675 @bhargaveede
- Enable warmup also for full prompt length case in text generation #1676 @yeonsily
- Add padding to input for mllama/paligemma/idefices2 #1671 @sywangyi
- Fix for Mixtral G1 pytest failures #1652 @12010486
- Fix textual_inversion_sdxl failure on docker 1.20 #1697 @atakaha
- Updated Encoder_decoder Tests #1688 @slokesha
- Add checks for parallel_state initialization #1680 @yafshar
- Update the readme to remove validated models #1703 @jiminha
- FP8 baichuan-13b gets oom when running lm_eval with @Liangyx2
- Lm eval upgraded to 0.4.7 #1692 @12010486
- Enable attention selection in wav2vec-ac #1713 @ugolowic
- Fix bug when preparing quant files, starcoder model does not support #1672 @kaixuanliu
- Update training pytests to reduce total time #1712 @jiminha
- Dropping some ci tests from image_to_text and text_generation #1710 @hsubramony
- Add save_checkpoint arg for TIMM training to simplify validation #1701 @ZhengHongming888
- Added Unit Test for Gemma-2-27b model #1616 @slokesha
- Update TRL README.md to clean up models #1706 @shepark
- Support regional compilation #1618 @chaojun-zhang
- Fix text generation quality for bf16 models when sampling #1644 @skavulya
- Readme modification #1700 @libinta
- Fix mpt model generation #1696 @mengniwang95
- Fix lm_eval issue of llama #1606 @sywangyi
- Align diffusers CI tests with examples #1679 @dsocek
- Update audio-classification/requirements.txt to fix numpy version #1717 @hsubramony
- Improve automation for stable-diffusion training scripts in README #1651 @dsocek
- Fix video diffusion black output if --bf16 is set #1685 @sywangyi
- Fix sdxl mlperf time bug #1580 @huijuanzh
- Enabling minimize memory for zero3 runs #1724 @bhargaveede
- Add gated models to diffusers CI tests #1690 @dsocek
- Fix formatting of the kubeVersion range in Kubernetes helm chart #1733 @dmsuehir
- Fix llava/llava next issue when working with AutoProcessor #1674 @sywangyi
- fea(): reworked the 8x hpu skipping strategy #1694 @imangohari1
- Process getting killed while loading data for Llama3.2 90b, 8x #1723 @kalyank007
- Fix: Adjust recipe to fit within QueueComputeScal HBM global memory size limit #1722 @kalyank007
- Add PRC models to test_text_generation_example.py #1695 @wenbinc-Bin
- Added quant config files for new scenarios #1681 @ulivne
- Update README.md - correction in diffusers example #1742 @ramyij
- Update DS config to align with recommended settings #1730 @ckvermaAI
- Add dynamo cache size limit option #1619 @chaojun-zhang
- Resolve 'NoneType' object has no attribute 'gate_proj' err when applying EP in DeepSeek-V2 #1740 @IT-Forrest
- Edit mixtral quantization config file #1739 @dudilester
- Fix the incorrect output of sdxl inpaint #1737 @yuanwu2017
- Supports Bitsandbytes development on HPU #1714 @rsshaik1
- FLAN-T5 has bad performance when using regional compilation #1744 @chaojun-zhang
- Add batch dim idx to support latest deepspeed DistributedAttention #1725 @bhargaveede
- Add the inline_inbuilt_nn_modules option #1617 @chaojun-zhang
- Clean up README examples #1709 @yeonsily
- Accuracy fix for llama3.1-70B in eager/torch.compile mode #1746 @ckvermaAI
- Adjust baselines for lower number of epochs improved perplexity, lower throughput #1748 @emascarenhas
- Change clip-roberta/bridgetower not to use fast_ddp #1749 @jiminha
- Adds requirements.txt to sentence transformers training paraphrases #1753 @pi314ever
- Add requirements.txt to sentence transformer training sts #1754 @pi314ever
- Add diffuser tests for optimized sdxl flow on HPU #1554 @sushildubey171
- Fix the output length in image_to_text test #1751 @sywangyi
- Fix Experts Indexing in MoE for Mixtral: Align experts_max with Number of Available Experts #1755 @deepak-gowda-narayana
- Add requirements.txt to sentence transformers nli example #1767 @pi314ever
- UX code change #1764 @talexjohn
- Enable saving and loading FP8 model #1683 @xin3he
- Update measurements for Stable Diffusion XL #1773 @mkrze
- Add datasets to the requirements for Stable Diffusion training #1782 @yafshar
- Enable wav2vec-large model for speech_recognition test #1783 @jiminha
- Update multi-node-training environment variables for GaudiNIC #1779 @Jianhong-Zhang
- Fixed Gemma2 error when saving pretrain #1781 @kplau1128
- Support llava1.5 lora finetuning. #1487 @lkk12014402
- Fix DeepSeek-V2 expert-parallelism crash due to indexing error #1765 @skavulya
- Update transformer_engine._convert_model to skip LoRA layers #1766 @vivekgoe
- Create Habana_Validated_Models.md to list all the models validated #1778 @hsubramony
- Enable attention selection for wav2vec2 #1757 @ugolowic
- Add --attn_implementation to wav2vec2 slow tests #1788 @ugolowic
- Add sentencepiece to the requirements #1792 @hsubramony
- Fix LoRA weights loading in text-to-image generation sample script #1789 @dsocek
- Add trust_remote_code #1786 @atakaha
- Fix the restart issue for Sentence Transformer STS example in validation #1799 @ZhengHongming888
- Exp flags for acc issues #1795 @hsubramony
- Temporary WA for get_type error #1806 @12010486
- Fix Sentence Transformer STS restart issue #1814 @ZhengHongming888
- Fix broken link for GenerationConfig #1819 @xin3he
- Fix for text-generation, AttributeError: 'GenerationConfig' object has no attribute 'use_fused_rope' #1823 @hsubramony
- Fix dataset_version for ST example requirement.txt #1809 @ZhengHongming888
- Move model to device before wrapping with FSDP #1830 @skaulintel
- Update warmup ratio for adalora #1820 @astachowiczhabana
- Fix for attention selection in wav2vec2 #1836 @ugolowic
- Revert "Lm eval upgraded to 0.4.7 (#1692)" #1837 @astachowiczhabana
- Removing HL_DS_DISTRIBUTED_ATTENTION_SEQ_DIM as it's not needed from 1.20 #1726 @bhargaveede
- Temporary workaround to avoid segmentation fault #1798 @yafshar
v1.15.0: SynapseAI v1.19.0, FLUX, Mllama, DeepSeek, Falcon 3
SynapseAI v1.19
FLUX
- FLUX with diffusers 0.31.0 #1450 @dsocek
- FLUX Fine-Tuning for Gaudi #1482 @dsocek
- Flux Image-To-Image pipeline #1524 @dsocek
New models
- Optimized inference of Cohere model on HPU #1329 @XinyuYe-Intel
- Idefics2 #1270 @sywangyi
- Optimized inference of XGLM model on HPU #1323 @XinyuYe-Intel
- Add mllama support #1419 @sywangyi
- Enable paligemma model for image-to-text example #1407 @kaixuanliu
- Enable Gemma2 Inference on Gaudi #1504 @Luca-Calabria
- Minicpm enabling #1342 @pi314ever
- Enable Falcon-mamba #1480 @yuanwu2017
- Add support for Baichuan2 #1479 @xhaihao
- Enable DeepSeek-V2 #1475 @yao-matrix
- Add chatglm #1478 @mengker33
- Falcon Model Support #1612 @alekseyfa
Various model optimizations
- Enable flash attention for gemma #1454 @atakaha
- Support loading 4 bit Qwen2 #1476 @mengniwang95
- Fixed Gemma FP8 flash_attention lower throughput issue #1510 @kplau1128
- Disable default sdpa in Albert (#22) #1517 @astachowiczhabana
- Implement fused sdpa for wav2vec2 (#18) #1520 @astachowiczhabana
- Memory optimization for gpt_bitcode #1513 @astachowiczhabana
- Support beam search with reuse_cache and bucket_internal #1472 @Wei-Lin-Intel
- Add mixtral trl sft #1349 @lkk12014402
- Enable tiiuae/falcon-11B-vlm in image_to_text example #1490 @sywangyi
- Enable fusedsdpa kernel for vision part of mllama #1531 @sywangyi
- Enable dynamic compile for mpi(training) #1509 @chaojun-zhang
- Add DynamicMoE support for Mixtral #1511 @kwisniewski98
- Implemented fusedSDPA for stable diffusion (#36) #1545 @astachowiczhabana
- Fix Accuracy Calculation Issue in GPT-NeoX #1591 @yafshar
Sentence Transformers
- Update sentence transformer to v3.2.1 #1470 @ZhengHongming888
Textual Inversion XL
TIMM
- Enable pyTorch-IMage-Models (TIMM) with HPUs #1459 @ZhengHongming888
Context Parallelism
- Adding support for Context Parallelism using Deepseed's DistributedAttention #1501 @bhargaveede
- Move parallel_state.py to the distributed folder a6ee7c2044e6ddf7d19ae3ad663149e51d6f89e7 @regisss
CI improvements
- Tests for text gen output text #1411 @vidyasiv
- Add split runners to CI (2 devices per runner for fast tests) 72df37df46d1d2a2665c5d1be43b13704b7c8ada @regisss
- Fix fast CI to work with split runners #1534 @regisss
- Add Llama 3.1 ft to CI #1529 @MohitIntel
Documentation
Other
- Fix facebook/hf-seamless-m4t-medium crash #1433 @sywangyi
- Fix bias update in scoped all reduce #1456 @skavulya
- fea(pytests): Added skip for unsuported tests for mistral/mixtral #1462 @imangohari1
- Remove deprecated Mixed precision flags #1471 @vivekgoe
- Readme: replace tabs with spaces #1485 @mgonchar
- Move fast tests to Gaudi2 #1498 @regisss
- Remove torch req from LM example #1491 @astachowiczhabana
- Remove keep_input_mutations #1492 @astachowiczhabana
- Fix trust_remote_code #1493 @astachowiczhabana
- Upgrade ViT README with torch.compile #1494 @astachowiczhabana
- Corrected Throughput measure for GaudiDDPMPipeline #1460 @deepak-gowda-narayana
- [SW-196761] Add G3 in T5-L README #1523 @astachowiczhabana
- Fix tuple object error #1354 @SupreetSinghPalne
- Add warmup time and compile time log for the eval/prediction. #1489 @jiminha
- Add support for MLPERF optimized pipeline from example #1465 @ANSHUMAN87
- Add check_neural_compressor_min_version for 4 bit behavior #1500 @xin3he
- Pass "lazy_mode" arg to GaudiLlamaModel GaudiTrainer #1515 @astachowiczhabana
- Removed workaround for NaN bug causing graph break. #1516 @astachowiczhabana
- text_generation: improve parameters check #1527 @mgonchar
- transformers: fixed some typos #1528 @mgonchar
- Makes the with_stack of the profiler changeable #1497 @ranzhejiang
- Fix dtype issue with valid sequence length in torch.compile bs=1 #1532 @wszczurekhabana
- Migrate OH CLIP (roberta-clip) training to torch.compile #1507 @chaojun-zhang
- test_text_generation: fix non-Gaudi2 case #1530 @mgonchar
- text-generation: improve output printing #1486 @mgonchar
- Text-generation, model set-up: torch.compile for attributes instead of models' types #1452 @dsmertin
- Fix bridgetower example #1481 @astachowiczhabana
- Migrate OH Wave2Vec-AC training to torch.compile - README update #1537 @astachowiczhabana
- Migrate OH T5-large training to torch.compile #1506 @chaojun-zhang
- trainer: fixed spelling #1538 @mgonchar
- Create CI Eager/Lazy for Language Modeling #1448 @Luca-Calabria
- Fixes for llava-next test failures in 1.19 #1535 @tthakkal
- Refactor Qwen2 Family #1541 @Wei-Lin-Intel
- Add support for optimized SDXL pipeline #1519 @sushildubey171
- Add the checkout parameters of falcon-mamba pytest #1540 @yuanwu2017
- Avoid negative values in eval metrics #1533 @deepak-gowda-narayana
- Fix lm_eval script for starcoder and gemma #1463 @skavulya
- Add option to use bf16 in PT sdp (#5) #1514 @astachowiczhabana
- Fix tests.test_peft_inference failure #1543 @sywangyi
- Update lm_eval version #1473 @alexey-belyakov
- Fix bad import in Baichuan code #1547 @regisss
- Restore performance in generate #1546 @ugolowic
- Fix for llava models not generating text with test failures in 1.19 #1548 @tthakkal
- Refactor KV cache, Rope , reduce common code #1148 @abhilash1910
- Adjust Qwen2-7B test case #1551 @Wei-Lin-Intel
- [run_lm_eval.py] Fixed too many print dump json info #1553 @FocusLuo
- Fix for single_card llama7b and falcon40b CI errors #1549 @MohitIntel
- Apply --sdp_on_bf16 to image-to-text examples #1557 @schoi-habana
- Fix accuracy regression in Gemma #1556 @skavulya
- Fix FusedSDPA wrapper from TransformerEngine #1562 @pbielak
- Run albert-xxlarge-v1 CI as torch.compile mode #1563 @yeonsily
- Update README commands for the models to use --sdp_on_bf16 #1566 @yeonsily
- Minicpm patch #1567 @pi314ever
- Updated gemma_2b_it CI #1561 @Luca-Calabria
- Fixed Adalora Test for OH 1.15 #1564 @npiroozan
- Fixed LORACP Test for OH 1.15 #1568 @npiroozan
- Fix prefix llama ci failure #1570 @sywangyi
- Fix mllama test #1569 @sywangyi
- Fix lazy_mode assignment #1558 @vidyasiv
- Generation utils update (minor) #1468 @yafshar
- Style: removed tabs #1577 @mgonchar
- Enable num_return_sequences in beam search #1536 @mengker33
- gpt_bigcode: added internal bucketing fix #1526 @mgonchar
- Update the Gaudi trainer with transformers 4.45.2 #1398 @yafshar
- Revert "add check_neural_compressor_min_version for 4 bit behavior" #1578 @xin3he
- Revert PR #1473 #1582 @regisss
- Fixed spelling #1576 @mgonchar
- Update docs for baichuan2 training #1586 @xhaihao
- Add WA flag for falcon-180b to resolve text-gen critical reset error during tests #1590 @hchauhan123
- Update transformers tests generation util v4.45.2 #1441 @malkomes
- Limit position embeddings in inference #1598 @bhargaveede
- Verify model output is provided when check_output is enabled #1597 @vidyasiv
- Update README.md #1595 @skaulintel
- Fix scikit-learn to 1.5.2 to fix f1 evaluation crash in 1.6.0 #1596 @sywangyi
- Update language-modeling README file #1599 @vivekgoe
- Revert common KVCache not to check token_idx #1594 @jiminha
- Revert LlamaKVCache due to memory increase #1605 @jiminha
- Replace the UNET custom attention processors #1608 @yafshar
- Fix run_generation test commands for TRL out usage example #1621 @shepark
- Update sdp_on_bf16 option for ST example #1615 @ZhengHongming888
- Update save lora weights for diffusers with text_encoder_2 layers #1626 @skavulya
- Fix save_lora_weights in pipeline_utils.py #1643 @regisss
- Check rope_scaling attr #1609 @jiminha
- Skip certain tests for G1 with empty param list #1613 @hsubramony
- Revert "Update transformers tests generation util v4.45.2 (#1441)" #1614 @yeonsily
- Audio classification readme update #1604 @hsubramony
- Fix readme cmds for clip-roberta #1603 @hsubramony
- Add arbitrary scales #1625 @jiminha
- Modify Qwen2 TRL command to avoid OOM. #1630 @jiminha
- Fix distributed issue for ST Trainer #1649 @ZhengHongming888
- Fix distributed issue for timm #1653 @ZhengHongming888
- Refactor mixtral moe block. #1635 @lkk12014402
- Speech-recognition: downgrade datasets version #1646 @hsubramony
- Add sdp_on_bf16 to controlnet #1631 @skaulintel
- Quick fix for quantization/custom op list loading #1657 @dsocek
- Fix bug for GaudiMixtralAttentionLongSequence forward #1650 @kaixuanliu
v1.14.1: Patch release
- Enable DeepSpeed for image-to-text example #1455 @schoi-habana
- Fix bug when loading 4bit checkpoint quantized in INC #1447 @xin3he
- Fixes 'Tokenizer does not have padding token' introduced by #1444 for Llama3.1 #1457 @MohitIntel
Full Changelog: v1.14.0...v1.14.1
v1.14.0: Transformers v4.45, SynapseAI v1.18, Qwen2-MoE, text-to-video generation
Transformers v4.45
SynapseAI v1.18
Qwen2-MoE
Text-to-video generation
- Enabling Text to Video Diffusion Model Generation #1109 @pi314ever
- Porting Stable Video Diffusion ControNet to HPU #1037 @wenbinc-Bin
Depth-to-image generation
- Depth to Image Generation #1175 @pi314ever
Model optimizations
- Enable FusedSDPA for Mpt #1101 @Jianhong-Zhang
- Mixtral fp8 #1269 @imangohari1
- Prevent Graph break in Llama when using flash attention #1301 @pramodkumar-habanalabs
- Boost SDXL speed with initialized schedule step reset #1284 @dsocek
- Improve MPT fp8 #1256 @atakaha
- Add Whisper static generation #1275 @Spycsh
- Gemma: enabled HPU Graphs and Flash Attention #1173 @dsmertin
- Recommend jemalloc for gpt-neox-20b 8x #1350 @hsubramony
- Optimized inference of GPT-NEO model on HPU #1319 @XinyuYe-Intel
- Fix graph breaks for BART in torch.compile mode. #1379 @astachowiczhabana
- Gpt_bigcode: added internal_bucketing support #1218 @mgonchar
- refine bucket_internal for mpt #1194 @Jing1Ling
- Qwen finetuning bucketing #1130 @ssarkar2
- Enable FusedSDPA fp8 in Llama FT #1388 @pbielak
- Added gemma specific fp8 quantization file #1445 @yeonsily
Intel Neural Compressor
- Enable INC for llava models and change softmax to use torch.nn.functional.softmax as its supported module by INC #1325 @tthakkal
- Load INC GPTQ checkpoint & rename params #1364 @HolyFalafel
- Fix load INC load weights compile error due to Transformer 4.45 upgrade. #1421 @jiminha
Vera/LN-tuning
Other
- Add callable workflow to post comments when code quality check failed #1263 @regisss
- Fix failed code quality check comment workflow #1264 @regisss
- Accelerate Diffusers CI #1265 @regisss
- Add profiler to SD3 #1267 @atakaha
- Fix profiling step with device finish execution for text-generation #1283 @libinta
- Update FusedSDPA calling method as Gaudi documentation #1285 @yeonsily
- Switch failed code quality check comment to workflow_run #1297 @regisss
- Potential fix for the failed code quality check comment workflow #1299 @regisss
- Fix text-generation example lm_eval evaluation #1308 @changwangss
- Add section to README about Transformers development branch #1307 @regisss
- Fix eager mode in run_generation by removing graph logs #1231 @Vasud-ha
- Fix bug when running google/paligemma-3b-mix-224 #1279 @kaixuanliu
- Use native checkpointing under compile mode #1313 @xinyu-intel
- fixed fused_qkv object AttributeError due to 'LlamaConfig' #1203 @rkumar2patel
- Image to Image Generation Enabling #1196 @pi314ever
- Diffusers timing #1277 @imangohari1
- Fix eos issue in finetune/generation #1253 @sywangyi
- Update CI, tests and examples #1315 @regisss
- Fix Sentence Transformer HPU graphs for training with PEFT model #1320 @nngokhale
- Fix ZeroDivisionError in constrained beam search with static shapes #1317 @skavulya
- Update esmfold model not to use param_buffer_assignment #1324 @jiminha
- Falcon inference crash fix for falcon-40b model #1161 @yeonsily
- Add --use_kv_cache to image-to-text pipeline #1292 @KimBioInfoStudio
- Trl upgrade #1245 @sywangyi
- Fix uint4 url typo. #1340 @kding1
- Use eager attention for wav2vec2 #1333 @skaulintel
- Add _reorder_cache back to Llama for HPU #1233 @jiminha
- SDXL CI script throughput #1296 @imangohari1
- Add image so that transformers tests can run #1338 @skaulintel
- Fixes the no attribute error with the falcon multicard test #1344 @mounikamandava
- Add profiler to sdxl mlperf pipeline #1339 @Jianhong-Zhang
- Fix decoder only generation #948 @tjs-intel
- Upgrade gradient chekpointing #1347 @yafshar
- Run_generation example: fixed graph compilation statistics reporting #1352 @mgonchar
- Fix deepseeed crash with Sentence Transformer Trainer #1328 @nngokhale
- fea(ci): reduced slow test_diffusers timing. minor fixes #1330 @imangohari1
- Flash attn args for GaudiGemmaForCausalLM #1356 @kkoryun
- Transformer models generation supports user-provided input embeddings #1276 @zongwave
- Fixed the expected values after for img2img slice #1332 @imangohari1
- Gpt_big_code: make flash attention impl quantization friendly #1282 @mgonchar
- Fix OOM when inference with llama-3.1-70b #1302 @harborn
- Fix the conditional #1362 @yafshar
- Revert "use native checkpointing under compile mode" #1365 @xinyu-intel
- Remove repetitive pip install commands #1367 @MohitIntel
- Minor UX enhancement #1373 @MohitIntel
- Fix bug when running image-to-text example #1371 @kaixuanliu
- Gpt_bigcode: fixed wrong indentation #1376 @mgonchar
- Support for transformers without self.model to torch.compile #1380 @astachowiczhabana
- Only pass the use_kv_cache True to generator #1366 @yafshar
- Clean up the code and remove unnecessary class #1382 @yafshar
- Add the diffusers examples of inference Tech #1244 @yuanwu2017
- Enhance transformers test suite in Optimum-habana-4.43.4 Auto pr 07654de #1387 @rkumar2patel
- Enhance transformers test suite in Optimum-habana-4.43.4 (auto PR 8926a4b) #1386 @rkumar2patel
- Add README.md for Sentence transformer examples with HPU device #1355 @ZhengHongming888
- Change Falcon/GPT-Neox rotary embedding function to use seq_len for #1368 @yeonsily
- Enhance Optimum-habana as per transformers-4.43.4 #1381 @rkumar2patel
- CI fix - Install stable-diffusion reqs #1389 @vidyasiv
- Fix error caused by uninitialized attn_weights #1391 @hsubramony
- Replace flash attention flag #1393 @skaulintel
- Fix DeepSpeed CI on Gaudi2 #1395 @regisss
- Truncate the cached max seq len #1394 @astachowiczhabana
- Fix gpt-neox training accuracy issue. #1397 @yeonsily
- Simplify HQT config files #1219 @Tiefen-boop
- unify_measurements.py script support to unify PCQ 70B 8x #1322 @Yantom1
- Add misc. training args #1346 @SanityRemnants
- Add quantization config for low bs case #1377 @ulivne
- Remove HQT from OHF #1257 @Yantom1
- Valid sequence length for sdpa #1183 @ssarkar2
- Multiple fixes (dynamo graph break, qwen-moe, multicard) #1410 @ssarkar2
- Change the image path for transformers tests back to the correct location #1401 @skaulintel
- Fix Gaudi2 regression tests #1403 @regisss
- Reverting some of transformer pytest funcs/values #1399 @imangohari1
- Fix StarCoder2 inference #1405 @regisss
- Change the order for test_diffusers #1406 @hsubramony
- Fix llama model text generation error #1402 @zongwave
- Datasets downgrade version to 2.21.0 #1413 @hsubramony
- Update ci sentence_transformer.sh #1424 @ZhengHongming888
- Update language-modeling README.md, add trust_remote_code for flan-t5-xl #1422 @hsubramony
- Update unify_measurements.py support info #1425 @shepark
- Fix GPT_neox incorrect output with batch query #1358 @Jianhong-Zhang
- Fix text-to-image example #1429 @regisss
- Add flag to run inference with partial dataset #1420 @pramodkumar-habanalabs
- Add peft generation example #1427 @sywangyi
- Added missing allocate_kv_cache() call in CausalLM class #1431 @yeonsily
- Fix merge error and update text-to-speech readme #1436 @hsubramony
- Fix OOM error for code llama #1437 @jiminha
- Fix error on 4bit checkpoint load with run_lm_eval on TF4.45.2 #1439 @jiminha
- GPT2 torch.compile fix #1434 @dsmertin
- Update text-gen README.md to add auto-gptq fork install steps #1442 @hsubramony
- Fix scoped linear all-reduce for starcoder model #1432 @skavulya
- Fixed recursion error in SentenceTransformer #1428 @yafshar
- Fix Llama 3.1 generation #1444 @regisss
- Remove cache folder from image data folder #1446 @shepark
v1.13.2: Patch release
Llava(-next) improvements
This patch release adds multi-card support for Llava(-next) and enables users to turn on/off recomputing for flash attention.
- Llava: Added flash_attention_recompute arg to provide an option to enable/disable recompute #1278 @tthakkal
- Add the deepspeed injection_policy of mistral #1309 @yuanwu2017
Full Changelog: v1.13.1...v1.13.2
v1.13.1: Patch release
Fixed memory regressions
- Remove _expand_inputs_for_generation for greedy search (#1266) @libinta
- Fix memory regression for modeling llama (#1271) @libinta
FSDP
FSDP checkpoint saving is fixed.
Known limitations
- ESMFold does not work on Gaudi1, this will be fixed in a future version
Full Changelog: v1.13.0...v1.13.1
v1.13.0: Stable Diffusion 3, Sentence Transformers, SAM, DETR, Kubernetes example
SynapseAI 1.17
- Upgrade SynapseAI version to 1.17.0 #1217
Transformers 4.43
Diffusers 0.29
Stable Diffusion 3
Training with Sentence Transformers
- Enable Sentence Transformer Trainer with Gaudi #1111 @ZhengHongming888
Model optimizations
- Fix starcoder2 accuracy issue and optimize performance with fused rope #1095 @mandy-li
- Enable FusedRoPE using float32 for gpt-neox model #1104 @yeonsily
- Mamba initial enablement. #1122 @libinta
- Adding fused qkv support along with config #1102 @bhargaveede
- Enhance Qwen2 with fastsoftmax and bf16 RoPE and cache optimization #1087 @Zhiwei35
- Enable fp8 inference for Llava-Next and add Fused_SDPA #1120 @tthakkal
- Support bucket_internal for MPT #1137 @pk1d3v
- Enable Flash Attention (Fused SDPA) for Starcoder #1114 @abhilash1910
- gpt_bigcode: added FusedSDPA kernel #1138 @mgonchar
- Enable torch.compile for Granite20B #1185 @dvarshney-habana
- Refine use cache for mpt model #1158 @Jing1Ling
- GPT-J support reuse_cache #1094 @atakaha
- Use fast softmax only on prefill #1159 @jaygala223
- Starcoder2 : KVCache and flash attention (FusedSDPA) enablement #1149 @abhatkal
- Gpt bigcode fused sdpa #1260 @yeonsily
SAM, FastVIT, VideoMAE, OpenCLIP, DETR, Table Transformer, deciLM
- Add an example of Segment Anything Model [Inference] #814 @cfgfung
- Add an example of FastViT model (Infernece) #826 @cfgfung
- VideoMAE Model Enabling and Examples #922 @pi314ever
- OpenCLIP sample for visual question answering #977 @vidyasiv
- Enabled DETR (Object Detection) model #1046 @cfgfung
- Table transformer enabling #978 @pi314ever
- deciLM support #1133 @sywangyi
Stable Diffusion inpainting, unconditional image generation
- Add the Stable diffusion inpaint support #869 @yuanwu2017
- Enable Unconditional Image Generation on Gaudi 2 [Diffuser/Tasks] #859 @cfgfung
Text feature extraction example
- Feature extraction enabling #994 @pi314ever
Tensor parallelism
- Tensor parallel distributed strategy without using deepspeed #1121 @kalyanjk
- Disable torch.compile for all_reduce when parallel_strategy is set to "tp" #1174 @kalyanjk
Kubernetes cluster example
- Adds a helm chart, dockerfile, and instructions for running examples using a Kubernetes cluster #1099 @dmsuehir
- Fix PyTorch version in the Kubernetes docker-compose to match image #1246 @dmsuehir
FP8 training
- TE FP8 integration #1096 @sanjucsudhakaran
Other
- Updates run_lora_clm.py with enhanced dataset support #955 @dmsuehir
- Fix prefix tuning finetune issue and update test #975 @sywangyi
- Fix throughput calculation in image-to-text example #1070 @regisss
- SDXL-trainig: fixed ci, changed gated dataset, fixes for non-square datasets #1038 @imangohari1
- Updating batch_size of Albert-XXL in README #1063 @vineethanandh
- Fix the error of running run_pipeline.py of text_generation example #1055 @yuanwu2017
- Add a test for llama finetuning with FP8 precision #1106 @sanjucsudhakaran
- Beam-search fix #1113 @ssarkar2
- Add chat format support dataset in SFT #1066 @libinta
- Fix nan loss of gemma and crash if dataset_concatenation is not set #1088 @sywangyi
- torch.compile keep input mutation in graph this avoids unnecessary memcpy #1069 @sushildubey171
- Updated langchain text-generation pipeline to work with latest release 0.2.5 #1084 @rbrugaro
- Add the MC example #891 @yuanwu2017
- Fix recompiles if limit_hpu_graph is False #1129 @ssarkar2
- Update examples batchsize in README #1123 @shepark
- Fix OOM error in SDXL Fine-Tuning validation stage #1134 @dsocek
- Added an example code to demonstrate how to use deterministic image generation #878 @cfgfung
- SD image variation/InstructPix2Pix/StableDiffusionXLImg2ImgPipeline pipeline #988 @sywangyi
- Add ci test for trl rewarding and ppo, fix backward failure in ppo caused by rmsfusion #1020 @sywangyi
- Llama adapter #983 @sywangyi
- torch.flip issue is fixed in SynapseAI 1.16, so remove the WA #1092 @sywangyi
- Fix test CausalLanguageModelingLORAExampleTester KeyError #1139 @dmsuehir
- fix(ci): new runs-on #1136 @XciD
- Add trust_remote_code for loading datasets in the audio classification example #1074 @regisss
- Generation example: print number of warmup iterations #1145 @mgonchar
- CI Updates: text-gen to recieve ranks/bs, Updated bs/metric for baselines #1140 @imangohari1
- Support for custom files for run_lora_clm.py #1039 @vidyasiv
- Change the device_id for FSDP plugin #1086 @ckvermaAI
- Set KV Cache update as static method #1160 @ulivne
- To fix CPU tensor issue #1157 @mkumargarg
- Adding missing init.py to mistral and mixtral test package #1188 @rkumar2patel
- Add example of multitask_prompt/poly tuning #915 @sywangyi
- Fix data-type mismatch for mlperf_inference accuracy test #1146 @kalyanjk
- Fix spawn MP context, limit cpu and download data #1131 @polisettyvarma
- T5 multi card #1222 @yafshar
- Add trust_remote_code for t5 poly-tuning test #1220 @yafshar
- Resolve "empty tensor optional" error with hpu_graphs + kv cache for StarCoder #1181 @vidyasiv
- Fix VIT, add wav2vec comment #1223 @ssarkar2
- Roberta tests were running on CPU #1229 @ssarkar2
- Fix bert/roberta contrastive search tests #1226 @skavulya
- Remove the default env variable to trust remote code by default #1225 @yafshar
- Improve style check workflow #1230 @regisss
- Added scheduler selection for SDXL fine-tuning #867 @kplau1128
- Clear help msg for ignore_eos to avoid misunderstanding @sywangyi
- Support loading hugging face checkpoint #1165 @ulivne
- Change triggering event for code style check #1238 @regisss
- gptj: fix missing token_idx #1234 @envsp
- fix(nltk): fixed the version to working one #1247 @imangohari1
- Updating to avoid hardcoding tests in CI framework #1221 @vidyasiv
- Fix FSDP graph error due to Tranformer 4.43 update #1251 @jiminha
- Fix SD README commands #1250 @imangohari1
- Fix spelling errors #1252 @changwangss
- Set HLS_MODULE_ID only if it wasn't set previously #1254 @astachowiczhabana
- Fix overflow of steps in SDXL for default diffusers scheduler @dsocek
- fix(test_diffusers): automated the checking for tests without upstream HF #1232 @imangohari1
- fix(nltk): Revert 1247. Updated the version. added the punkt_tab download #1258 @imangohari1
- Set input_embeds before it gets used #1261 @tthakkal
- Update README and more changes, rebase to main #1259 @shepark
Known limitations
- For Llama, some big batch sizes lead to out-of-memory errors whereas they used to work
v1.12.1: Patch Release
Fix 1st token latency time measure
Fix for Mixtral
- Mixtral typo fix #1107 @schoi-habana
Other
Full Changelog: v1.12.0...v1.12.1
v1.12: Qwen2, Gemma, SVD, Dreambooth, speculative sampling
SynapseAI v1.16
Transformers 4.40
Speculative Sampling
- Speculative sampling on Gaudi using Optimum-Habana #973 @nraste
- Fix assisted decoding generation error #1080 @libinta
Model optimizations
- Add --bucket_size support for gpt_bigcode #802 @jiminha
- Optimize StableLM model inference #805 @XinyuYe-Intel
- Enable google/gemma-7b. #747 @lkk12014402
- Enable llava static generation. #767 @lkk12014402
- Fix perf drop in flan-t5 summarization #908 @MohitIntel
- Enable Qwen2 model #774 @XinyuYe-Intel
- Extend bucket_internal to SAMPLE generation mode #819 @xt574chen
- SpeechT5 static consistent dropout #824 @Spycsh
- Optimize inference of Persimmon model #822 @XinyuYe-Intel
- Enable OWL-ViT graph mode on Gaudi platform #783 @cfgfung
- Support mixtral kvcache reuse and remove kv_cache_fp8 #898 @jychen21
- Add fp8 related changes to mistral for text-generation #918 @skaulintel
- Optimization for phi series models: support fp8 kv cache and reuse kv cache #902 @yuwenzho
- Support Mistral 32K input token #931 @jiminha
- Support mixtral long sequence 32k with bs 4 #903 @jychen21
- Adapt Mixtral long sequence handling for Mistral #985 @jiminha
- Fix performance issue in mistral #1030 @jiminha
- Optimized inference of Starcoder2 model #829 @XinyuYe-Intel
- Add support for IBM Granite #1045 @regisss
- Enable fp8 inference for Llava-hf 7B and 13B in 1.16 release #951 @Luca-Calabria
- Fusedrope inp bf16 #1026 @ssarkar2
- Enhance Qwen2 model with FSDPA and bucket #1033 @Zhiwei35
- Optimize seamless-m4t/vits model for text-to-speech generation #825 @sywangyi
- cache_optimization #1028 @ssarkar2
- Ensure KV cache is not returned as output tensor during decode phase for Falcon #993 @schoi-habana
- Fast softmax #972 @wszczurekhabana
- Falcon optimization #974 @libinta
- Quantization for FSDPA #976 @dudilester
- Falcon update park #1052 @ssarkar2
- Add the Llava_next support #1041 @yuanwu2017
- Improve torch compile performance #1082 @libinta
Stable Video Diffusion
PEFT
- Add ia3 and adalora support #809 @sywangyi
- Enable prompt tuning/prefix tuning/p tuning clm and example #758 @sywangyi
TRL
Object Segmentation Example
Dreambooth
Others
- Text generation pipeline: Extended functionality to align with run_generation script #782 @mgonchar
- Enable clip mediapipe and update G2 baseline #856 @MohitIntel
- Add ci test for SFT and DPO #857 @sywangyi
- Fix SFT, DPO CI on Gaudi1 #893 @regisss
- Add SDXL in README #894 @regisss
- Fix falcon 180b oom issue if peft > 0.6.2 #895 @sywangyi
- Enabled additional models in CI #879 @MohitIntel
- Add static shape support for vision_encoder_decoder generation if decoder supports static shape #834 @sywangyi
- Add HabanaProfile to Stable Diffusion and XL #828 @atakaha
- Pytest accuracy updates for Falcon, T5, GPT2 #916 @Luca-Calabria
- Update text-generation readme with torch.compile info. #884 @libinta
- Update Wav2Vec2ModelTest::test_initialization #919 @malkomes
- Add linear and dynamic RoPE to Mistral and Mixtral #892 @regisss
- Fix for wav2vec2 test cases #923 @lqnguyen
- Add nograd() to prevent backward backend #897 @astachowiczhabana
- Assisted decoding not implemented #910 @tjs-intel
- Disable wav2vec2 symbolic tracing test #904 @tjs-intel
- Add support for symbolic tracing of GPT2 models #913 @tjs-intel
- Utils: return more reasonable error in case of attempt of non-PyTorch model loading #921 @mgonchar
- Pytest accuracy updates for Bridgetower, Swin, Vit #927 @Luca-Calabria
- Text generation: added langchain pipeline script #887 @mgonchar
- Fix for AST models #914 @vidyasiv
- Fix AttributeError for wav2vec test #929 @Jianhong-Zhang
- Fix ValueError for test_summarization #939 @Jianhong-Zhang
- Grad norm tensor fix #938 @yeonsily
- Add information to the audio-classification examples README about --ddp_find_unused_parameters parameter #941 @Alberto-Villarreal
- Add leaderboard link #947 @echarlaix
- Fix formatting of arg parse help strings in the PEFT example #944 @dmsuehir
- Use new Habana llama and falcon model configs #940 @skaulintel
- Update based on legal requirements. #900 @libinta
- Update test generation config to raise ValueError #949 @malkomes
- Add --trust_remote_code for text generation examples #870 @yangulei
- Added Llama-2 fp8 text-generation test cases #934 @yeonsily
- Upgrade SD output image verification with CLIP score #920 @MohitIntel
- Llama Guard for text classification example #871 @dsmertin
- Update README logo #950 @regisss
- Add Gaudi CI for Sentence Transformers #928 @regisss
- Get iteration times through generate() #899 @hsubramony
- Update speech recognition seq2seq example #953 @regisss
- Fix wrongly all_gather for mixtral finetune #965 @ccrhx4
- Add intel-mila protST example #860 @sywangyi
- Small CI refacto #968 @regisss
- Llama70b one card to infer device map with max memory limitation #963 @Yantom1
- Map list to tensors #926 @ssarkar2
- Fix fsdp lora torch compile issue #971 @sywangyi
- Fix for the simulate_dyn_prompt flag assertion #984 @alekseyfa
- Initial enablement with FP8 Training (port from OHF #91) #936 @libinta
- Warn user when using --disk_offload without hqt #964 @Yantom1
- Assign grad_norm for logging only if it's a single element tensor #992 @yeonsily
- Update examples #998 @regisss
- Fix warmup for diffusers when batch size < throughput_warmup_steps #960 @dsocek
- Add torch.compile instructions for Roberta-Large #981 @MohitIntel
- Fix gpt_neox, stablelm inference regression caused by RoPE dtype #999 @mandy-li
- fea(examples): Updated the READMEs with requirements.txt installation #1000 @imangohari1
- Initial commit for fp8 CI #995 @yeonsily
- Fixed 'MixtralConfig' object has no attribute 'rope_scaling' #1009 @aslanxie
- Use the lenght of timesteps as the inference step num #986 @yuanwu2017
- Fix the bug of output_type=np or latent. #996 @yuanwu2017
- Fix wav2vec test load adapter #937 @malkomes
- Mark scale as const and remove --fp8 flag usage #962 @Yantom1
- Add per step time collection to other methods #1004 @ssarkar2
- Fix first token time #1019 @ssarkar2
- Fix text-generation example #1025 @regisss
- Updates test_beam_search to transformers_4.40 #1017 @malkomes
- Fix eos problem #1034 @sywangyi
- fp8 textgen ci structure update #1029 @jiminha
- Fix a return value issue casued by PR 973 #1040 @yafshar
- Add no_checks for sub dataset in lvwerra/stack-exchange-paired since it does not contain test split #1003 @sywangyi
- Readme Update for FSDP #980 @hlahkar
- Add unifier script and disk offload flag usages to README. #1023 @libinta
- Add mixtral for meta device load due to mixtral-8x22b model size #909 @libinta
- Update unifier script #1010 @Yantom1
- Update text-generation CI configuration for falcon and Mixtral #1044 @yeonsily
- Update multi-node README to check ssh connection issue #1048 @yeonsily
- Infra upgrade workflows #480 @glegendre01
- Update test_text_generation_example.py #1051 @ssarkar2
- BERT training migrated to torch.compile #990 @ANSHUMAN87
- Update test_examples.py #1053 @ssarkar2
- Update modeling_llama.py: deepspeed fix for codellama #1054 @ssarkar2
- No shapes in profilings by default #1050 @astachowiczhabana
- Change the way to unset environemt variable for gpt-neox ci #1060 @yeonsily
- Update README for Albert torch.compile mode #1061 @MohitIntel
- Fix lm_evaluation_harness to specific commit (#240) #1064 @astachowiczhabana
- Fix text-generation example README.md #1081 @shepark